Implementation Plan: Cloud Snapshot Demo Lifecycle

Branch: 008-cloud-snapshot-lifecycle | Date: 2026-02-27 | Spec: spec.md Input: Feature specification from /specs/008-cloud-snapshot-lifecycle/spec.md

Summary

Add snapshot-based lifecycle management to the Hetzner Cloud demo infrastructure. Instead of provisioning from scratch every time (~25 min), users snapshot a working cluster once and restore from snapshots in under 5 minutes. Four new Bash scripts (demo-cloud-snapshot.sh, demo-cloud-warm.sh, demo-cloud-cool.sh, demo-cloud-health.sh) extend the existing cloud infrastructure tooling. Snapshot restore bypasses Terraform, using hcloud CLI directly with label-based resource tracking. A post-restore Ansible playbook handles hostname fixup after cloud-init. A standalone health check with auto-remediation verifies all services before demos.

Technical Context

Language/Version: Bash (POSIX-compatible with Bash extensions, matching existing scripts), Ansible 2.16+ (post-restore playbook), Python 3.9+ (date parsing helper, inline) Primary Dependencies: hcloud CLI 1.42+, Terraform 1.7+ (for cold-build only), Ansible 2.16+, jq (JSON parsing), openssh-client Storage: Local JSON manifest file (infra/terraform/snapshot-manifest.json), Hetzner Cloud snapshot storage (remote) Testing: Manual end-to-end testing against live Hetzner Cloud environment (no unit test framework for Bash scripts — follows existing project pattern) Target Platform: macOS / Linux (developer machines), Docker container (rcd-demo-infra:latest) Project Type: Infrastructure scripts extending existing IaC project Performance Goals: Warm-start < 5 min (SC-001), health check < 60 sec (SC-003), snapshot creation < 10 min (SC-002) Constraints: Must work inside Docker container AND natively; must not modify existing demo playbooks or scenarios; must follow existing script patterns (set -euo pipefail, exit codes, output formatting) Scale/Scope: 4 new scripts (~200-400 lines each), 1 Ansible playbook (~30 lines), 4 Makefile targets, 1 JSON manifest file

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

Principle	Status	Notes
I. Plain Language First	PASS	Scripts use clear info/warn/error messages; health check outputs human-readable table
II. Data Model as Source of Truth	PASS	Snapshot manifest is single source for set metadata; cloud labels provide API-based discovery
III. Compliance as Code	PASS	Feature extends infrastructure tooling, not compliance controls; existing roles unchanged
IV. HPC-Aware	N/A	No HPC-specific considerations for snapshot lifecycle
V. Multi-Framework	N/A	No compliance framework interactions
VI. Audience-Aware Documentation	PASS	Quickstart.md provides user-facing guide; contracts define technical interface
VII. Idempotent and Auditable	PASS	Health check is read-only and idempotent; snapshot/restore are one-shot operations with clear state transitions
VIII. Prefer Established Tools	PASS	Uses hcloud CLI (official Hetzner tool), Ansible (existing stack), jq (standard JSON tool); no custom tooling reinvented

Gate result: PASS — all applicable principles satisfied.

Project Structure

Documentation (this feature)

specs/008-cloud-snapshot-lifecycle/
├── plan.md              # This file
├── spec.md              # Feature specification
├── research.md          # Phase 0: Technical research and decisions
├── data-model.md        # Phase 1: Data model (manifest schema, health report)
├── quickstart.md        # Phase 1: User-facing quick start guide
├── contracts/
│   └── cli-interface.md # Phase 1: CLI command contracts
├── checklists/
│   └── requirements.md  # Spec quality checklist
└── tasks.md             # Phase 2: Implementation tasks (via /speckit.tasks)

Source Code (repository root)

infra/scripts/
├── demo-cloud-snapshot.sh   # NEW: Create/list/delete snapshot sets
├── demo-cloud-warm.sh       # NEW: Restore cluster from snapshots
├── demo-cloud-cool.sh       # NEW: Graceful session wind-down
├── demo-cloud-health.sh     # NEW: Service health check with auto-remediation
├── demo-cloud-up.sh         # MODIFIED: Add snapshot prompt after successful provisioning
├── demo-cloud-down.sh       # UNCHANGED
├── check-ttl.sh             # UNCHANGED (already supports hcloud label queries)
└── docker-run.sh            # UNCHANGED

infra/terraform/
├── snapshot-manifest.json   # NEW: Local snapshot set metadata (gitignored)
├── inventory.yml            # EXISTING: Generated by warm-start (same format)
└── *.tf                     # UNCHANGED

demo/playbooks/
├── post-restore.yml         # NEW: Hostname fixup after snapshot restore
├── provision.yml            # UNCHANGED
└── scenario-*.yml           # UNCHANGED

Makefile                     # MODIFIED: Add demo-warm, demo-cool, demo-snapshot, demo-health targets
.gitignore                   # MODIFIED: Add snapshot-manifest.json

Structure Decision: Extends the existing infra/scripts/ directory with 4 new scripts following the established demo-cloud-*.sh naming pattern. One new Ansible playbook in demo/playbooks/ for post-restore hostname fixup. No new directories created — everything fits into existing project structure.

Key Technical Decisions

See research.md for full rationale on each decision.

Bypass Terraform for restore — Use hcloud CLI directly. Snapshot-restored clusters are tracked via cloud labels and local manifest, not Terraform state. Avoids state conflicts and simplifies the restore workflow.
Post-restore hostname fixup — Cloud-init overwrites FQDN hostnames on boot. A minimal Ansible playbook (post-restore.yml) restores *.demo.lab FQDNs and restarts affected services (~30 seconds).
Label-based resource discovery — All snapshot-restored resources are labeled with cluster=rcd-demo and snapshot-set=<label>. This enables teardown via hcloud selectors and maintains compatibility with existing check-ttl.sh.
Service stop before snapshot — Critical services (FreeIPA, Slurm, Wazuh, Munge) are stopped before snapshotting to protect database consistency. Services restart immediately after snapshot creation completes.
Health check with single-retry remediation — On service failure, attempt one systemctl restart, wait 5 seconds, re-check. Report final status. Handles transient post-boot service ordering issues without masking deeper problems.

Complexity Tracking

No constitution violations to justify. All design choices use established tools (hcloud CLI, Ansible, Bash, jq) and follow existing project patterns.

🕸️ Ada Research Browser